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Abstract 

Background: Divergence in gene structure following gene duplication is not well understood. Gene duplication 
can occur via whole-genome duplication (WGD) and single-gene duplications including tandem, proximal and 
transposed duplications. Different modes of gene duplication may be associated with different types, levels, and 
patterns of structural divergence. 

Results: In Arabidopsis thaliana, we denote levels of structural divergence between duplicated genes by differences 
in coding-region lengths and average exon lengths, and the number of insertions/deletions (indels) and maximum 
indel length in their protein sequence alignment. Among recent duplicates of different modes, transposed 
duplicates diverge most dramatically in gene structure. In transposed duplications, parental loci tend to have longer 
coding-regions and exons, and smaller numbers of indels and maximum indel lengths than transposed loci, 
reflecting biased structural changes in transposed duplications. Structural divergence increases with evolutionary 
time for WGDs, but not transposed duplications, possibly because of biased gene losses following transposed 
duplications. Structural divergence has heterogeneous relationships with nucleotide substitution rates, but is 
consistently positively correlated with gene expression divergence. The NBS-LRR gene family shows higher-than 
-average levels of structural divergence. 

Conclusions: Our study suggests that structural divergence between duplicated genes is greatly affected by the 
mechanisms of gene duplication and may be not proportional to evolutionary time, and that certain gene families 
are under selection on rapid evolution of gene structure. 
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Background 

Gene duplication is an important mechanism for evolu- 
tion of functional novelty and increase of genome com- 
plexity [1], Gene duplication may occur by different 
modes such as whole-genome duplication (WGD) [2] 
and single-gene duplications [3-5]. For example, 
Arabidopsis thaliana has experienced at least three 
WGD events — two recent events (a and (3) since its di- 
vergence from other members of the Brassicales clade 
and a more ancient event (y) shared with most if not all 
eudicots [6]. Single-gene duplications including local 
(tandem or proximal) and dispersed duplications also 
contribute to the origin of a substantial portion of 
Arabidopsis genes [5,7,8]. Transposed gene duplications, 
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which relocate duplicated genes to new chromosomal 
positions via either DNA or RNA-based mechanisms 
[7,9], may contribute to the widespread existence of dis- 
persed duplicates in the Arabidopsis genome [5,7]. 

Since a likely consequence of gene duplication is rever- 
sion to single copy (singleton) status [1], mechanisms 
for the retention of duplicated genes have been exten- 
sively studied. The 'neo-functionalization' model sug- 
gests that each of two duplicated genes can be retained 
if at least one evolves modified or novel functions [1], 
The 'sub-functionalization' model suggests that both du- 
plicated genes can be preserved if they partition the 
functions of their ancestor, through accumulation of de- 
generative mutations [10,11]. More recent models for 
gene retention include genetic buffering [12], functional 
redundancy [13-15], dosage balance constraints 
[5,16,17], or need for enhanced expression levels [18,19]. 
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Retention of duplicated genes does not occur ran- 
domly. Following duplication, genes belonging to some 
functional categories have been preferentially restored to 
singleton status across different eukaryotic lineages [20]. 
In plants, modes of gene duplication retain genes in a 
biased manner [5]. Genes related to transcription factors, 
protein kinases, and ribosomal proteins are preferentially 
retained following WGDs [4,21], while those genes related 
to abiotic and biotic stress are more likely to be retained 
following local duplications [22,23]. Gene transpositions 
are more frequent in some families such as F-box, MADS- 
box, NBS-LRR, and defensins than others [5,8]. 

Evolutionary consequences following different modes 
of gene duplication have been widely investigated. Dupli- 
cated genes retained from WGDs show lower levels of ex- 
pression divergence [24-27], functional innovation [28,29], 
network rewiring [29,30] and epigenetic changes [31] than 
single-gene duplicates. Moreover, among single-gene du- 
plications, transposed duplicates tend to evolve faster than 
tandem or proximal duplicates [25-27,31]. 

Functional divergence between duplicated genes was 
presumed to be driven by nucleotide substitutions in- 
cluding enhancer/promoter mutations, and non- 
synonymous and synonymous substitutions [24-27]. 
However, insertions/deletions (indels) between dupli- 
cated genes, which may cause shifts of reading frame 
[32], have greater effects on the divergence in protein 
secondary structures [33-35]. In addition, duplicated 
genes also diverge in exon-intron structures following 
gene duplication, which was suggested to play an im- 
portant role during the evolution of duplicated genes 
[36]. These facts, taken together, suggest that divergence 
in gene structures such as exon configuration and indels 
may also drive the functional divergence between dupli- 
cated genes. 

In this paper, we study structural divergence between 
duplicated genes in Ambidopsis thaliana. We describe 
levels of structural divergence between duplicated genes 
using four different measures. Structural divergence is 
compared among different modes of gene duplication 
including WGD, and tandem, proximal and transposed 
duplications, and then related to duplication epochs, nu- 
cleotide substitutions and expression divergence. Evolu- 
tionary mechanisms for gene-structure divergence are 
also investigated. 

Results 

Comparison of structural divergence among different 
modes of gene duplication 

Modes of gene duplication in Arabidopsis were classified 
into WGD (a, (3 and y events) and tandem, proximal and 
transposed (<16 Mya, i.e. after Arabidopsis-Brassica diver- 
gence, and 16-107 Mya, i.e. between Arabidopsis-Brassica 
and Arabidopsis-Populus divergence) duplications, as 



described in Methods. Divergence between duplicated 
genes often increases with duplication age [24,26,27]. 
To compare the evolutionary effects of different modes 
of gene duplication, it may be helpful to take duplication 
age into account. Here, synonymous (Ks) substitution 
rates are used as a rough proxy of duplication age. The 
Ks distributions of different modes of gene duplication 
are shown in Figure 1. The duplicated genes belonging to 
a WGD, tandem duplication, proximal duplication and 
transposed duplication after Arabidopsis-Brassica diver- 
gence (<16 Mya) are relatively younger than those belong- 
ing to (3 and y WGDs and transposed duplication between 
Arabidopsis-Brassica and Arabidopsis-Populus divergence 
(16-107 Mya). Thus, to compare structural divergence 
among different modes of gene duplication, we restricted 
WGD duplicates to those retained from the a event, and 
transposed duplications to those that occurred after 
Arabidopsis-Brassica divergence (<16 Mya). 

Structural divergence between duplicated genes was 
measured by differences in coding-region lengths and 
average exon lengths, and the number of indels and 
maximum indel length in their protein sequence align- 
ment. Comparison of structural divergence among dif- 
ferent modes of gene duplication is shown in Figure 2. 
When measured by differences in coding-region lengths 
and average exon lengths and the maximum indel length, 
structural divergence between duplicated genes shows the 
following trend: WGD < tandem < proximal < transposed 
(comparisons between consecutive gene duplication 
modes are significant at a = 0.05, Wilcoxon test). When 
measured by the number of indels, structural divergence 
between duplicated genes follows a slighdy different trend: 
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Figure 1 Ks distributions of different modes of gene duplication. 
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Figure 2 Comparison of structural divergence among different modes of gene duplication. To minimize the effects of duplication age, 
WGD duplicates were restricted to those retained from the a event, and transposed duplications were restricted to those that occurred after 
Arabidopsis-Brassica divergence (<16 Mya). 



tandem < proximal < WGD < transposed (comparisons be- 
tween consecutive gene duplication modes are significant 
at a = 0.05, Wilcoxon test). These comparisons, taken to- 
gether, suggest that transposed duplications diverge more 
dramatically in gene structure than any other mode of 
gene duplication. 

Transposed duplications are often associated with biased 
changes in gene structure 

In transposed duplications, duplicated genes are trans- 
posed from ancestral (parental) loci to novel (trans- 
posed) loci [7], Transposed duplications may occur via 
DNA or RNA-based mechanisms, and the latter mech- 
anism, often referred to as retrotransposition, creates 
intronless retrocopies [9]. Comparison of gene structure 



between parental and transposed loci may help to better 
understand the genetic mechanisms and evolutionary ef- 
fects of transposed duplications. We note that in this 
analysis we computed numbers of indels and maximum 
indel lengths for parental and transposed duplicates sep- 
arately. We found that parental loci generally have lon- 
ger coding-regions and exons, and fewer indels with 
smaller maximum indel lengths than transposed loci 
(Figure 3), suggesting that transposed duplications tend 
to be associated with biased changes in gene structure. 
In other words, transposed duplication is a singular 
mode of gene duplication in which gene structure not 
only undergoes intensive changes but also is biased to- 
ward smaller gene size and complexity. A trend toward 
shorter exons, more indels and bigger maximum indel 
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Figure 3 Percentages of different relationships (greater, equal or less) of structural features between the parental copy (P) and the 
transposed copy (T) in transposed duplications. 



lengths suggests that transposed duplications are not 
perfectly copied and losses of DNA segments frequently 
happen. This trend is contrary to the classical theory 
that duplicated genes are fully redundant immediately 
following gene duplication [1] but consistent with the 
observation that various types of transposable elements 
frequently only duplicate gene fragments [37,38]. 

Structural divergence and duplication epochs 

To understand how structural divergence between dupli- 
cated genes changes over evolutionary time, we com- 
pared structural divergence among different epochs of 
gene duplications for WGDs (i.e. among a, [3 and y 
events) and transposed duplications (i.e. between those 
occurring <16 Mya and 16-107 Mya). Figure 4 shows 
that the structural divergence between WGD duplicates, 
based on all measures, consistently increases across a, (3 
and y events; however, for transposed duplications, only 
number of indels increases from <16 Mya to 16-107 
Mya. Moreover, transposed duplications show a decrease 
of maximum indel lengths from <16 Mya to 16-107 
Mya. Compared with WGDs, transposed duplications 
have a higher rate of gene losses, evidenced by an "L" 
shaped distribution of duplication age [11]. It is possible 



that the different changing patterns of structural diver- 
gence over evolutionary time between WGDs and trans- 
posed duplications are determined by the biased, high 
rate of gene losses associated with transposed duplica- 
tions, e.g. those duplicates that experienced extreme 
structural changes are less likely to survive over long pe- 
riods of evolutionary time than those that experienced 
more moderate structural changes. It is also worth men- 
tioning that transposed duplicates that have been pre- 
served for long times (16-107 Mya) still shows higher 
structural divergence than WGD duplicates retained 
from the ancient y event that occurred -117 Mya. 

Structural divergence and nucleotide substitutions 

For duplicated genes, structural divergence and nucleo- 
tide substitution are two major types of sequence diver- 
gence [36]. We compared non-synonymous substitution 
rates (Ka) among different epochs of gene duplication 
within WGDs and transposed duplications, and found 
the following trend: a WGD < |3 WGD < transposed (<16 
Mya) < y WGD < transposed (16-107 Mya) (comparisons 
between consecutive gene groups are significant at a = 
0.05, Wilcoxon test). However, structural divergence of 
recent transposed duplications (<16 Mya) tend to be 
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Figure 4 Comparison of structural divergence between different epochs of gene duplications within WGDs and transposed duplications. 



higher (except being measured by numbers of indels) 
than that of y WGD (Figure 4), suggesting that gene 
structure can evolve much faster than nucleotide 
substitutions. 

To further understand the relationships between struc- 
tural divergence and nucleotide substitutions, we com- 
puted the Pearson's correlations between the four 



measures for structural divergence and nucleotide sub- 
station rates including Ka and Ks, based on all dupli- 
cated genes disregarding their modes (Table 1). 
Differences in coding-region lengths are significantly, 
positively correlated with Ka and Ka/Ks, indicating that 
the evolution of gene lengths is related to selection. Dif- 
ferences in average exon lengths are also positively, but 



Table 1 Correlations between structural divergence and nucleotide substitution rates for all duplicate gene pairs 



Measure of structural Correlation (P-value) with 

diver 9 ence Ka~ ~~IW^ 



Difference in coding-region lengths 


0425 (0) 


-0.175 (1.841 X 10~ 75 ) 


0.525 (0) 


Difference in average exon lengths 


0.250 (0) 


0.018 (0.067) 


0.133 (0) 


Number of indels 


0.095 (0) 


0.110 (0) 


-0.005 (0.619) 


Maximum indel length 


0.040 (3.316 X 10~ 5 ) 


-0.016 (0.101) 


0.035 (2.1 69 X 10~ 4 ) 
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more moderately, correlated with Ka and Ka/Ks, indicat- 
ing that the evolution of exon lengths is also related to 
selection. However, the number of indels is more likely 
to be related to Ks than Ka or Ka/Ks, indicating that 
indels occur more or less randomly between duplicated 
genes. The correlations between maximum indel lengths 
and nucleotide substitution rates are generally trivial, 
perhaps because duplicated genes losing long coding 
segments are preferentially lost following duplication. 
Structural divergence between duplicated genes were 
previously suggested to occur more or less randomly, i.e. 
correlated with evolutionary time [36]. However, we 
show that structural divergence between duplicated 
genes are related to both neutral evolution and selection, 
indicating that structural divergence between duplicated 
genes is a complicated process subject to both intrinsic 
and extrinsic factors. 

Structural divergence and gene expression divergence 

Expression divergence between duplicated genes is pre- 
sumed to be determined by their genetic divergence 
such as regulatory sequence and coding sequence diver- 
gence. Indeed, expression divergence between duplicated 
genes was previously shown to be slightly correlated 
with Ka and/or Ks [24-26]. To date, it is unclear whether 
structural divergence between duplicated genes also 
affects their expression divergence. We computed the 
Pearson's correlations between the four measures for 
structural divergence and expression divergence based 
on the pooled modes of gene duplication (Table 2). All 
four measures of structural divergence are positively 
correlated with expression divergence, indicating that 
structural divergence between duplicated genes is related 
to expression divergence. This analysis suggests that to 
study the genetic mechanisms for expression evolution 
between homologs, it is useful to look into changes in 
their gene structures. 

The NBS-LRR gene family shows higher-than-average 
structural divergence 

The NBS-LRR genes have experienced frequent gene 
transposition in Arabidopsis [8]. As we have shown that 
transposed duplications tend to result in dramatic and 
biased changes in gene structure, we propose the hy- 
pothesis that the structural divergence between 



Table 2 Correlations between structural divergence and 
gene expression divergence for all duplicate gene pairs 


Measure for structural divergence 


Correlation 


P-value 


Difference in coding-region lengths 


0.130 


0 


Difference in average exon lengths 


0.076 


3.46 X Iff" 


Number of indels 


0.060 


1.561 X Iff 7 


Maximum indel length 


0.124 


0 



duplicated genes belonging to the NBS-LRR family is 
higher than the genome average. We computed the 
average structural divergence between duplicated genes 
belonging to the NBS-LRR family and compared it to 
that of the whole set of gene duplications using a i-test 
(Table 3). The NBS-LRR gene family indeed shows 
higher-than-average structural divergence based on all 
four measures, suggesting that certain gene families may 
be under the selection for rapid evolution of gene 
structure. 

Discussion 

Ks increases approximately linearly with time only for 
relatively low levels of sequence divergence [39], mean- 
ing that there is great uncertainty in using Ks to repre- 
sent evolutionary time. Thus, to ensure more accurate 
analyses, we did not use the correlation between struc- 
tural divergence and Ks to investigate how structural di- 
vergence changes over time. Patterns of gene colinearity 
conservation within and between genomes can be used 
to estimate the epochs for WGDs and gene transposi- 
tions as previously described [6,40,41]. After assigning 
different epochs to gene duplication modes, we used 
their Ks distributions only for confirming the order of 
their relative ages. 

Classical population genetic theories suggest that du- 
plicated genes have identical sequences immediately fol- 
lowing duplication, and then gradually diverge over 
evolutionary time [1]. The observation that structural di- 
vergence between WGD duplicates increases with time 
is consistent with this classical theory. Due to the fact 
that most tandem/proximal duplicates are relatively 
younger than the most recent, Arabidopsis-specific a 
WGD (Figure 1), comparison between different epochs 
of tandem/proximal duplications are not feasible in this 
work. However, the observation that transposed duplica- 
tions show dramatic and biased structural changes is in- 
consistent with the classical theory - but consistent with 
the observation that various types of transposable ele- 
ments frequently only duplicate gene fragments [37,38]. 

The observation that there is a decrease of maximum 
indel lengths between the transposed duplications that 
occurred <16 Mya and 16-107 Mya suggests that struc- 
tural divergence between duplicated genes may not be 
proportional to evolutionary time. More variations in 
maximum indel lengths in recently transposed genes 
could indicate that many transposed duplicates are 
essentially pseudogenes and not performing important 
functions [37], mixed in with the few that confer a strik- 
ing, adaptive change that may render them finally 
preserved. However, it should be noted that the striking 
structural changes that are beneficial still require the in- 
tactness of key biological functions, and the transposed 
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Table 3 Comparison of structural divergence of duplicated genes between the NBS-LRR gene family and all duplicate 
gene pairs 



Measure for structural divergence 


NBS-LRR gene family mean 


Population mean 


t-test 


P-value 




Difference in coding-region lengths 


119.7 


76.0 


2.862 


4.984 x • 


io 3 


Difference in average exon lengths 


255.8 


164.8 


2.902 


4.434 X ' 


io- 3 


Number of indels 


12.3 


7.0 


9.410 


5.367 X • 




Maximum indel length 


93.9 


51.6 


3.336 


1.139X' 


IO 3 



genes with extreme structural changes seldom survive 
over long evolutionary time. 

This study reveals that structural divergence between 
duplicated genes, measured in different ways, shows dif- 
ferent patterns depending on modes of gene duplication, 
and can be affected by both neutral evolution and selec- 
tion. Changes in gene structure between duplicated 
genes involve not only alteration of exon-intron struc- 
ture [36,42] and gain/loss of introns [43], but also gain/ 
loss of DNA segments within coding-regions [37,38] 
which occurs more extensively in transposed duplica- 
tions. Certainly there can be more measures to describe 
structural divergence between duplicated genes, and new 
biological insights can be generated based on novel mea- 
sures for structural divergence. For duplicated genes, 
structural divergence seems more complicated than nu- 
cleotide substitutions. Future studies toward better un- 
derstanding of the evolutionary mechanisms for gene 
structure changes are necessary. 

Conclusions 

In this work, we investigated structural divergence be- 
tween Arabidopsis duplicated genes. We found that 
transposed duplicates diverge more dramatically in gene 
structure than genes duplicated by other modes, and 
that the structural changes in transposed duplications 
are biased toward shorter length and lower complexity. 
Structural divergence increases with evolutionary time 
for WGDs, but not transposed duplications, possibly be- 
cause genes experiencing severe changes are preferen- 
tially lost. Structural divergence between duplicated 
genes is related to nucleotide substitution rates in differ- 
ent manners, but consistently positively correlated with 
expression divergence. The NBS-LRR gene family shows 
higher-than-average levels of structural divergence. This 
study suggests that structural divergence between dupli- 
cated genes, greatly affected by the mechanisms of gene 
duplication, may be not proportional to evolutionary 
time, and that certain gene families are under selection 
on rapid evolution of gene structure. 

Methods 

Genome annotations 

Genome annotations for Arabidopsis thaliana, Bmssica 
rapa, Populus trichocarpa and Vitis vinifera were 



obtained from Phytozome v8.0 (http://www.phytozome. 
net). For genes with multiple transcripts, only the lon- 
gest transcript was used in related analyses. 

Identification of gene duplication modes in Arabidopsis 

Transposable element-related genes in Arabidopsis were 
excluded from analysis. Arabidopsis WGD duplicates 
were initially obtained from a previous study [6]. Then, 
a WGD duplicates were updated according to another 
study [44], to exclude tandemly-duplicated WGD dupli- 
cates which were shown to have very similar evolution- 
ary patterns with tandem duplicates [45]. The WGD 
duplicate pairs included 3181 a, 1451 |3 and 521 y pairs. 
Other modes of gene duplication were identified from 
the BLASTP result [46] of the Arabidopsis thaliana gen- 
ome (E- value < 10" 10 & top five non-self hits for each 
gene). A total of 2130 tandem and 784 proximal duplica- 
tions were obtained based on the following criteria: tan- 
dem duplications were BLASTP hits to consecutive 
genes in the genome; proximal duplications were 
BLASTP hits to nearby genes in the genome interrupted 
by fewer than ten non-paralogous genes. 

To identify Arabidopsis transposed duplications, 
WGD duplicate pairs and tandem and proximal duplica- 
tions were removed from the BLASTP result. In 
Arabidopsis, ancestral loci were the colinear genes be- 
tween Arabidopsis and its outgroups (related genomes 
showing colinearity with Arabidopsis), and the non- 
colinear genes were deemed to be novel loci. 
Arabidopsis transposed duplications were the BLASTP 
hits consisting of an ancestral chromosomal locus and a 
novel locus. Note that based on different sets of 
outgroups, transposed duplications that occurred within 
different epochs can be inferred [40,41]. Using Brassica 
rapa, Populus trichocarpa and Vitis vinifera as outgroups, 
we identified 1701 transposed duplications which 
occurred after Arabidopsis-Brassica divergence, i.e. <16 
Million years ago (Mya). Using Populus trichocarpa and 
Vitis vinifera as outgroups, we identified 2731 transposed 
duplications which occurred after Arabidopsis-Populus 
divergence, i.e. <107 Mya. By subtraction of the above two 
sets of transposed duplications, the remained 1862 trans- 
posed duplications were inferred to have occurred be- 
tween Arabidopsis-Brassica and Arabidopsis-Populus 



Wang et al. BMC Genomics 2013, 14:652 
http://www.biomedcentral.com/1471 -21 64/1 4/652 



Page 8 of 9 



divergence, i.e. 16-107 Mya. Arabidopsis duplicated genes 
of different modes are listed in Additional file 1. 

Indels between duplicated genes 

The protein sequences of two duplicated genes were 
aligned using Clustalw [47] with default parameters. The 
Clustalw alignment was then transformed to a "fasta" 
format alignment, in which, gaps, i.e. consecutive 
were deemed to be indels. 

Coding sequence divergence 

Coding sequence divergence was measured by non- 
synonymous (Ka) and synonymous (Ks) substitution 
rates. The protein sequences of duplicate genes were 
aligned using Clustalw [47] with default parameters. 
Then, the protein sequence alignment was converted to a 
coding sequence alignment using the "Bio::Align::Utilities" 
module in the BioPerl package (http://www.bioperl.org/). 
Finally, Ka and Ks were calculated using the Yang & Niel- 
sen method [48] via the "Bio::Tools::Run::Phylo::PAML:: 
YnOO" module in the BioPerl package. 

Gene expression data 

Gene expression data generated from the Affymetrix 
Arabidopsis ATH1 Genome Array (GPL198) were 
obtained from previous studies [26,49]. The expression 
divergence between duplicated genes was measured by 
l-r, where r is the Pearson's correlation coefficient be- 
tween their expression profiles [26]. 

Additional file 



Additional file 1: Arabidopsis duplicated genes of different modes. 
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